Parallel clustering of high-dimensional social media data streams

机译：高维社交媒体数据流的并行聚类

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

We introduce Cloud DIKW as an analysis environment supporting scientificdiscovery through integrated parallel batch and streaming processing, and applyit to one representative domain application: social media data streamclustering. Recent work demonstrated that high-quality clusters can begenerated by representing the data points using high-dimensional vectors thatreflect textual content and social network information. Due to the high cost ofsimilarity computation, sequential implementations of even single-passalgorithms cannot keep up with the speed of real-world streams. This paperpresents our efforts to meet the constraints of real-time social streamclustering through parallelization. We focus on two system-level issues. Moststream processing engines like Apache Storm organize distributed workers in theform of a directed acyclic graph, making it difficult to dynamicallysynchronize the state of parallel workers. We tackle this challenge by creatinga separate synchronization channel using a pub-sub messaging system. Due to thesparsity of the high-dimensional vectors, the size of centroids grows quicklyas new data points are assigned to the clusters. Traditional synchronizationthat directly broadcasts cluster centroids becomes too expensive and limits thescalability of the parallel algorithm. We address this problem by communicatingonly dynamic changes of the clusters rather than the whole centroid vectors.Our algorithm under Cloud DIKW can process the Twitter 10% data stream inreal-time with 96-way parallelism. By natural improvements to Cloud DIKW,including advanced collective communication techniques developed in our Harpproject, we will be able to process the full Twitter stream in real-time with1000-way parallelism. Our use of powerful general software subsystems willenable many other applications that need integration of streaming and batchdata analytics.

机译：我们将Cloud DIKW引入作为一种分析环境，通过集成的并行批处理和流处理来支持科学发现，并将其应用于一种代表性的领域应用：社交媒体数据流集群。最近的工作表明，可以使用反映文本内容和社交网络信息的高维向量表示数据点，从而生成高质量的聚类。由于相似度计算的成本高昂，即使是单遍历算法的顺序实现也无法跟上实际流的速度。本文介绍了我们为通过并行化来满足实时社交流集群的约束而做出的努力。我们专注于两个系统级问题。像Apache Storm这样的大多数流处理引擎都以有向无环图的形式组织分布式工作程序，因此很难动态同步并行工作程序的状态。我们通过使用pub-sub消息传递系统创建单独的同步渠道来解决此挑战。由于高维向量的稀疏性，当将新的数据点分配给聚类时，形心的大小会快速增长。直接广播集群质心的传统同步变得过于昂贵，并限制了并行算法的可扩展性。我们仅通过交流集群的动态变化而不是整个质心向量来解决这个问题。我们在Cloud DIKW下的算法可以通过96路并行性实时处理Twitter的10％数据流。通过对Cloud DIKW的自然改进，包括在Harpproject中开发的先进的集体通信技术，我们将能够以1000路并行性实时处理整个Twitter流。我们对功能强大的通用软件子系统的使用将使许多其他需要将流和批处理数据分析集成在一起的应用程序成为可能。

著录项

作者
Gao, Xiaoming; Ferrara, Emilio; Qiu, Judy;
展开▼
作者单位

展开▼
年度 2015
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Clustering High-Dimensional Data Stream: A Survey on Subspace Clustering, Projected Clustering on Bioinformatics Applications (Advanced Science, Engineering and Medicine, Vol. 8(9), pp. 749–757 (2016)) [J] . Baghernia Ali, Pavin Hamid, Mirnabibaboli Miresmail, Advanced Science, Engineering and Medicine . 2017,第7期

机译：聚类高维数据流：生物信息学应用中预计集群的子空间聚类调查（高级科学，工程和医学，Vol.8（9），PP。749-757（2016））
2. ERRATUM: Clustering High-Dimensional Data Stream: A Survey on Subspace Clustering, Projected Clustering on Bioinformatics Applications [J] . Ali Baghernia, Hamid Pavin, Miresmail Mirnabibaboli, Advanced Science, Engineering and Medicine . 2017,第7期

机译：erratum：群集高维数据流：生物信息学应用中的子空间聚类调查，投影群集
3. Clustering High-Dimensional Data Stream: A Survey on Subspace Clustering, Projected Clustering on Bioinformatics Applications [J] . Ali Baghernia, Hamid Pavin, Miresmail Mirnabibaboli, Advanced Science, Engineering and Medicine . 2016,第9期

机译：聚类高维数据流：子空间聚类调查，生物信息学应用的预测聚类调查
4. Parallel Clustering of High-Dimensional Social Media Data Streams [C] . Xiaoming Gao, Ferrara Emilio, Qiu Judy IEEE/ACM international symposium on cluster, cloud and grid computing . 2015

机译：高维社交媒体数据流的并行聚类
5. Stream-Dashboard: A big data stream clustering framework with applications to social media streams. [D] . Hawwash, Basheer. 2013

机译：Stream-Dashboard：一个大数据流集群框架，其应用程序适用于社交媒体流。
6. Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT [O] . Dantong Wang, Simon Fong, Raymond K. Wong, -1

机译：通过ODR-ioVFDT进行强大的高维生物信息学数据流挖掘
7. Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects How to cite this article: Finn WG, Carter KM, Raich R, Stoolman LM, Hero AO. Analysis of clinical flow cytometric immunophenotyping data by clustering on statistical manifolds: Treating flow cytometry data as high-dimensional objects. Cytometry Part B 2009; 76B: 1–7. [O] . Finn, William G., Carter, Kevin M., Raich, Raviv, 2009

机译：通过聚类统计流形分析临床流式细胞免疫表型数据：将流式细胞术数据作为高维物体处理如何引用本文：Finn WG，Carter Km，Raich R，stoolman Lm，Hero aO。通过聚类在统计流形上分析临床流式细胞免疫表型分析数据：将流式细胞术数据作为高维物体处理。细胞计数B部分2009; 76B：1-7。

Parallel clustering of high-dimensional social media data streams

摘要

著录项

相似文献

相关主题

期刊订阅